{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare Dataset for Model Training and Evaluating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon Customer Reviews Dataset\n",
    "\n",
    "https://s3.amazonaws.com/amazon-reviews-pds/readme.html\n",
    "\n",
    "## Schema\n",
    "\n",
    "- `marketplace`: 2-letter country code (in this case all \"US\").\n",
    "- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.\n",
    "- `review_id`: A unique ID for the review.\n",
    "- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.\n",
    "- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.\n",
    "- `product_title`: Title description of the product.\n",
    "- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).\n",
    "- `star_rating`: The review's rating (1 to 5 stars).\n",
    "- `helpful_votes`: Number of helpful votes for the review.\n",
    "- `total_votes`: Number of total votes the review received.\n",
    "- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?\n",
    "- `verified_purchase`: Was the review from a verified purchase?\n",
    "- `review_headline`: The title of the review itself.\n",
    "- `review_body`: The text of the review.\n",
    "- `review_date`: The date the review was written.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q scikit-learn==0.20.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download\n",
    "\n",
    "Let's start by retrieving a subset of the Amazon Customer Reviews dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(102084, 15)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', \n",
    "                 delimiter='\\t', \n",
    "                 quoting=csv.QUOTE_NONE,\n",
    "                 compression='gzip')\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>marketplace</th>\n",
       "      <th>customer_id</th>\n",
       "      <th>review_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>product_parent</th>\n",
       "      <th>product_title</th>\n",
       "      <th>product_category</th>\n",
       "      <th>star_rating</th>\n",
       "      <th>helpful_votes</th>\n",
       "      <th>total_votes</th>\n",
       "      <th>vine</th>\n",
       "      <th>verified_purchase</th>\n",
       "      <th>review_headline</th>\n",
       "      <th>review_body</th>\n",
       "      <th>review_date</th>\n",
       "      <th>is_positive_sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>US</td>\n",
       "      <td>17747349</td>\n",
       "      <td>R2EI7QLPK4LF7U</td>\n",
       "      <td>B00U7LCE6A</td>\n",
       "      <td>106182406</td>\n",
       "      <td>CCleaner Free [Download]</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Four Stars</td>\n",
       "      <td>So far so good</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>10956619</td>\n",
       "      <td>R1W5OMFK1Q3I3O</td>\n",
       "      <td>B00HRJMOM4</td>\n",
       "      <td>162269768</td>\n",
       "      <td>ResumeMaker Professional Deluxe 18</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Three Stars</td>\n",
       "      <td>Needs a little more work.....</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>US</td>\n",
       "      <td>13132245</td>\n",
       "      <td>RPZWSYWRP92GI</td>\n",
       "      <td>B00P31G9PQ</td>\n",
       "      <td>831433899</td>\n",
       "      <td>Amazon Drive Desktop [PC]</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>One Star</td>\n",
       "      <td>Please cancel.</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>35717248</td>\n",
       "      <td>R2WQWM04XHD9US</td>\n",
       "      <td>B00FGDEPDY</td>\n",
       "      <td>991059534</td>\n",
       "      <td>Norton Internet Security 1 User 3 Licenses</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Works as Expected!</td>\n",
       "      <td>Works as Expected!</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>US</td>\n",
       "      <td>17710652</td>\n",
       "      <td>R1WSPK2RA2PDEF</td>\n",
       "      <td>B00FZ0FK0U</td>\n",
       "      <td>574904556</td>\n",
       "      <td>SecureAnywhere Intermet Security Complete 5 De...</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Great antivirus. Worthless customer support</td>\n",
       "      <td>I've had Webroot for a few years. It expired a...</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  marketplace  customer_id       review_id  product_id  product_parent  \\\n",
       "0          US     17747349  R2EI7QLPK4LF7U  B00U7LCE6A       106182406   \n",
       "1          US     10956619  R1W5OMFK1Q3I3O  B00HRJMOM4       162269768   \n",
       "2          US     13132245   RPZWSYWRP92GI  B00P31G9PQ       831433899   \n",
       "3          US     35717248  R2WQWM04XHD9US  B00FGDEPDY       991059534   \n",
       "4          US     17710652  R1WSPK2RA2PDEF  B00FZ0FK0U       574904556   \n",
       "\n",
       "                                       product_title  product_category  \\\n",
       "0                           CCleaner Free [Download]  Digital_Software   \n",
       "1                 ResumeMaker Professional Deluxe 18  Digital_Software   \n",
       "2                          Amazon Drive Desktop [PC]  Digital_Software   \n",
       "3         Norton Internet Security 1 User 3 Licenses  Digital_Software   \n",
       "4  SecureAnywhere Intermet Security Complete 5 De...  Digital_Software   \n",
       "\n",
       "   star_rating  helpful_votes  total_votes vine verified_purchase  \\\n",
       "0            4              0            0    N                 Y   \n",
       "1            3              0            0    N                 Y   \n",
       "2            1              1            2    N                 Y   \n",
       "3            5              0            0    N                 Y   \n",
       "4            4              1            2    N                 Y   \n",
       "\n",
       "                               review_headline  \\\n",
       "0                                   Four Stars   \n",
       "1                                  Three Stars   \n",
       "2                                     One Star   \n",
       "3                           Works as Expected!   \n",
       "4  Great antivirus. Worthless customer support   \n",
       "\n",
       "                                         review_body review_date  \\\n",
       "0                                     So far so good  2015-08-31   \n",
       "1                      Needs a little more work.....  2015-08-31   \n",
       "2                                     Please cancel.  2015-08-31   \n",
       "3                                 Works as Expected!  2015-08-31   \n",
       "4  I've had Webroot for a few years. It expired a...  2015-08-31   \n",
       "\n",
       "   is_positive_sentiment  \n",
       "0                      1  \n",
       "1                      0  \n",
       "2                      0  \n",
       "3                      1  \n",
       "4                      1  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enrich the data with `is_positive_sentiment` label\n",
    "* Positive (`1`):  `star_rating >= 4`\n",
    "* Negative (`0`) :  `star_rating <= 3`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>marketplace</th>\n",
       "      <th>customer_id</th>\n",
       "      <th>review_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>product_parent</th>\n",
       "      <th>product_title</th>\n",
       "      <th>product_category</th>\n",
       "      <th>star_rating</th>\n",
       "      <th>helpful_votes</th>\n",
       "      <th>total_votes</th>\n",
       "      <th>vine</th>\n",
       "      <th>verified_purchase</th>\n",
       "      <th>review_headline</th>\n",
       "      <th>review_body</th>\n",
       "      <th>review_date</th>\n",
       "      <th>is_positive_sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>US</td>\n",
       "      <td>17747349</td>\n",
       "      <td>R2EI7QLPK4LF7U</td>\n",
       "      <td>B00U7LCE6A</td>\n",
       "      <td>106182406</td>\n",
       "      <td>CCleaner Free [Download]</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Four Stars</td>\n",
       "      <td>So far so good</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>10956619</td>\n",
       "      <td>R1W5OMFK1Q3I3O</td>\n",
       "      <td>B00HRJMOM4</td>\n",
       "      <td>162269768</td>\n",
       "      <td>ResumeMaker Professional Deluxe 18</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Three Stars</td>\n",
       "      <td>Needs a little more work.....</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>US</td>\n",
       "      <td>13132245</td>\n",
       "      <td>RPZWSYWRP92GI</td>\n",
       "      <td>B00P31G9PQ</td>\n",
       "      <td>831433899</td>\n",
       "      <td>Amazon Drive Desktop [PC]</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>One Star</td>\n",
       "      <td>Please cancel.</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>35717248</td>\n",
       "      <td>R2WQWM04XHD9US</td>\n",
       "      <td>B00FGDEPDY</td>\n",
       "      <td>991059534</td>\n",
       "      <td>Norton Internet Security 1 User 3 Licenses</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Works as Expected!</td>\n",
       "      <td>Works as Expected!</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>US</td>\n",
       "      <td>17710652</td>\n",
       "      <td>R1WSPK2RA2PDEF</td>\n",
       "      <td>B00FZ0FK0U</td>\n",
       "      <td>574904556</td>\n",
       "      <td>SecureAnywhere Intermet Security Complete 5 De...</td>\n",
       "      <td>Digital_Software</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>N</td>\n",
       "      <td>Y</td>\n",
       "      <td>Great antivirus. Worthless customer support</td>\n",
       "      <td>I've had Webroot for a few years. It expired a...</td>\n",
       "      <td>2015-08-31</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  marketplace  customer_id       review_id  product_id  product_parent  \\\n",
       "0          US     17747349  R2EI7QLPK4LF7U  B00U7LCE6A       106182406   \n",
       "1          US     10956619  R1W5OMFK1Q3I3O  B00HRJMOM4       162269768   \n",
       "2          US     13132245   RPZWSYWRP92GI  B00P31G9PQ       831433899   \n",
       "3          US     35717248  R2WQWM04XHD9US  B00FGDEPDY       991059534   \n",
       "4          US     17710652  R1WSPK2RA2PDEF  B00FZ0FK0U       574904556   \n",
       "\n",
       "                                       product_title  product_category  \\\n",
       "0                           CCleaner Free [Download]  Digital_Software   \n",
       "1                 ResumeMaker Professional Deluxe 18  Digital_Software   \n",
       "2                          Amazon Drive Desktop [PC]  Digital_Software   \n",
       "3         Norton Internet Security 1 User 3 Licenses  Digital_Software   \n",
       "4  SecureAnywhere Intermet Security Complete 5 De...  Digital_Software   \n",
       "\n",
       "   star_rating  helpful_votes  total_votes vine verified_purchase  \\\n",
       "0            4              0            0    N                 Y   \n",
       "1            3              0            0    N                 Y   \n",
       "2            1              1            2    N                 Y   \n",
       "3            5              0            0    N                 Y   \n",
       "4            4              1            2    N                 Y   \n",
       "\n",
       "                               review_headline  \\\n",
       "0                                   Four Stars   \n",
       "1                                  Three Stars   \n",
       "2                                     One Star   \n",
       "3                           Works as Expected!   \n",
       "4  Great antivirus. Worthless customer support   \n",
       "\n",
       "                                         review_body review_date  \\\n",
       "0                                     So far so good  2015-08-31   \n",
       "1                      Needs a little more work.....  2015-08-31   \n",
       "2                                     Please cancel.  2015-08-31   \n",
       "3                                 Works as Expected!  2015-08-31   \n",
       "4  I've had Webroot for a few years. It expired a...  2015-08-31   \n",
       "\n",
       "   is_positive_sentiment  \n",
       "0                      1  \n",
       "1                      0  \n",
       "2                      0  \n",
       "3                      1  \n",
       "4                      1  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['is_positive_sentiment'] = (df['star_rating'] >= 4).astype(int)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "df_train, df_test = train_test_split(df, test_size=0.10, stratify=df['is_positive_sentiment'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_path = 'amazon_reviews_us_Digital_Software_v1_00_train.csv'\n",
    "\n",
    "df_train.to_csv(train_path, index=False, header=True)\n",
    "\n",
    "train_s3_prefix = 'data'\n",
    "\n",
    "train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_path = 'amazon_reviews_us_Digital_Software_v1_00_test.csv'\n",
    "\n",
    "df_test.to_csv(test_path, index=False, header=True)\n",
    "\n",
    "test_s3_prefix = 'data'\n",
    "\n",
    "test_s3_uri = sess.upload_data(path=test_path, key_prefix=test_s3_prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compare Positive to Negative Sentiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AutoPilot will automatically balance the data during feature engineering, so we don't need to manually balance.\n",
    "\n",
    "_Note:  You may need to run this next cell twice to see the chart._\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.countplot(x='is_positive_sentiment', data=df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "is_positive_sentiment_count = len(df.query('is_positive_sentiment == 1'))\n",
    "is_negative_sentiment_count = len(df.query('is_positive_sentiment == 0'))\n",
    "\n",
    "print('Positive count: {}'.format(is_positive_sentiment_count))\n",
    "print('Negative count: {}'.format(is_negative_sentiment_count))\n",
    "print('Ratio of Positive to Negative: {}'.format(is_positive_sentiment_count / is_negative_sentiment_count))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reduce the dataset to just `is_positive_sentiment` and `review_body`\n",
    "For now, we will only train the model with `review_body` feature and the `is_positive_sentiment` target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[['is_positive_sentiment', 'review_body']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split the data into `train` and `test` datasets\n",
    "\n",
    "Split into `90% train` data and `10% test` data using `is_positive_sentiment` to stratify the split.\n",
    "\n",
    "Note that AutoPilot will automatically split the train data into `train` and `validation` datasets, so we only need to preserve `10% test` dataset on our end.\n",
    "\n",
    "Also note that TF/IDF requires us to split before we generate the TF/IDF embeddings - otherwise, test and validation data will leak into the train dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "df_train, df_test = train_test_split(df, test_size=0.10, stratify=df['is_positive_sentiment'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show the split details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('df_train.shape: {}'.format(df_train.shape))\n",
    "print('df_test.shape: {}'.format(df_test.shape))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save the the `Train` Dataset Locally and Upload  to S3 for AutoPilot\n",
    "_Note:  AutoPilot requires a header, so we use `header=True`._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_path = 'data/train.csv'\n",
    "df_train.to_csv(train_path, index=False, header=True)\n",
    "\n",
    "train_s3_prefix = 'data'\n",
    "\n",
    "train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(train_s3_uri)\n",
    "\n",
    "!aws s3 ls $train_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Store the location of our train data in our notebook server to be used next"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store train_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save the `Test` Dataset Locally to Use Later to Evaluate the AutoPilot Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_path = 'data/test.csv'\n",
    "\n",
    "df_test.to_csv(test_path, index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We have upload our `train` dataset to S3 to be used next"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(train_s3_uri)\n",
    "\n",
    "!aws s3 ls $train_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We have saved the S3 location to our `train` dataset in Jupyter to be used later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r train_s3_uri\n",
    "\n",
    "print(train_s3_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We have our local `train` dataset to be used later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -al ./data/train.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We have our local `test` dataset to be used later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -al ./data/test.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
